-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gnomAD v4 short variants #1178
Conversation
6a86005
to
b778919
Compare
51cf678
to
aed81ab
Compare
604f868
to
ab01ef3
Compare
f52e1a5
to
620d360
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking quite cool, lots of great stuff that I'm excited to get acquainted with.
In general I'm pretty excited about the typing you've been adding, in addition to the making it far easier to create tests for pipelines locally with the dataset config.
In regards to the tooling:
- Poetry, I've had a few difficulties with this so far, likely some significant amount of this is user error and/or some quirk of my machine, but Poetry does not consistently activate a virtual environment with
poetry shell
for me, even to the point of telling me its active when runningwhich python
shows that is most certainly is not active. So that's just to say I'm glad that at least if I'm not messing with dependencies it still can output a genericrequirements.txt
file. I'm curious to hear Phil's thoughts on Poetry. - Ruff, I'm sold
- Black, I'm sold on shortening the line length. It appears as though 88 is their default value, so I'm on board with using their choice, as they're in the business of having opinions.
- Type checking with pyright + attrs, this syntax does seem nice to quickly define types.
- Loguru, I'm sold.
- I/O validation with types, yet again, I'm sold, seems like a one time big cost then a bit of upkeep for huge benefits
- Commiting Hail schemas to Source Control, this seems like it could have some nice benefits, what is the intended workflow with this? Is it to locally run the python script to generate the schemas, then check for a diff in between the old one and the new one, before running a pipeline? In that case do we remember to manually run this on our machine, and check these new ones into source control, then having some CI check for a difference a la automated Jest tests? Or is this more just to always be able to reference the old schema, I know I've spent enough time making a tiny script to run describe on a dataset because I don't have the schema handy
- Dataset Configs, I see how this is useful for being able to generate tests for entire pipelines. Would a small test for a given pipeline require replicating the entire pipeline in a test file (running all the same tasks in the same order) with a different dataset config that dictates a different input and output? Or is the benefit here being able to test the code for the actual pipeline at all, not what a particular pipeline does in its own logic?
In general, I haven't gotten my hands too dirty with these proposed changes in a firsthand way yet, but I'm excited about the suggestions and to learn from the knowledge you've gained from other projects.
Thank you for the feedback, Riley!
Interesting. Poetry isn't a hard requirement for me, but I do find it makes it significantly easier to keep dependencies up to date and to know exactly what versions are being used. I'm happy to remove this if people would like, in favor for a simpler
Basically yes I would see us having a shared "test dataset" in GCS somewhere that is downloaded locally with a pipeline step, the pipeline is executed, and then a diff is checked in CI. Ideally this is not strictly manual; the idea is to mimic Jest-style snapshots. And yes having a browsable schema in source is a pretty significant bonus, IMHO.
Not quite sure what is being asked here, but happy to chat in person to clarify. I think it's the later; having a systematic way to run a small dataset through the pipeline is useful for prototyping and making changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
Addition of gnomAD v4 short variants
Major update to add support for v4 short variants to browser.
Pipeline changes
This PR includes some updates to the gnomAD pipeline tool chain, tests, and modifications to the pipeline configuration. There is also the start of the v4 variant/coverage pipeline.
Still a draft & not yet confirmed for breaking changes to other pipelines; but feel free to look and provide initial feedback
Tool chain changes
Dependency management with poetry
Previously we had only a
requirements.txt
Now dependencies installed with poetry, kept in
pyproject.toml
This creates a lock file with exact versions.
Then generate
requirements.txt
with:poetry export --with dev >requirements.txt
or by running
update-requirements.sh
UPDATE: you don't have to use poetry if you don't touch dependencies; just use
requirements.txt
.Linting
Pylint -> Ruff
https://docs.astral.sh/ruff/
Black
I haven't done this yet, but I'd suggest decreasing line length 120 -> 88?
Type checking & input validation
Classes are defined with
@attr.define
Example:
Advantages:
Nice editor experience!
More readable (compared to using dictionaries or having no types).
Use
create()
to instantiate classes instead of__init__
Use
structure
andunstructure
utility functions to:Logging
Suggestion use Loguru instead of std lib logging (see their website for the advantages).
Input/output validation
Using ChatGPT, hail schemas, and
attrs/cattrs
libraries, it is possible to quickly create classes that can be used to validate pipeline I/O.e.g.
Then validate:
Checks for missing data too.
Comitting hail schemas to source control
write_schemas.py
Iterates through all hail inputs/outputs for a given pipeline & writes them to
schemas
dir.e.g.
/schemas/gnomad_v4_variants/annotate_gnomad_v4_exome_transcript_consequences/output/gnomad_v4_variants_annotated_2.ht.schema
This is analogous to our jest snapshots. If anything in the pipeline changes, the schema will be tracked in source control. And it's nice to have as a reference.
https://github.com/broadinstitute/gnomad-browser/pull/1178/files#diff-0811404f3bc0e96d8cd34bae975ef717b89943056daace9d6a5c280515412762R41-R54
Modifications to core Pipeline class
Improved IO config
Previously, one would run a pipeline with an
--output-root
, a required arg.Data inputs were hardcoded, or relied on other pipelines that may not exist.
Makes it hard to develop pipelines in isolation.
The
Pipeline
class changed to make this an optional arg, accepts a config object that allows specifying both input & output dirs.A new
PipelineMock
class makes it possible to stub other pipeline outputs.This makes it possible to easily switch between local dev datasets & prod full datasets.